Lab One Exploring the dataΒΆ
Team memberΒΆ
Rongwei Ji, Jonathan Mejia, Xiaoqing Zou
Data introductionΒΆ
Data source: https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster Data file name: Lab1/data/ObesityDataSet.csv
17 colums total, 2111 rows, it contains the categorical targets
EnvironmentΒΆ
python 3.11.5, conda 23.9.0 ,numpy-1.26.3 pandas-2.2.0 pytz-2023.4 tzdata-2023.4, pip-23.3.2 matplotlib: 3.8.2
Something moreΒΆ
Kaggal generate new dataset form the original one by the Deep learning about 20758 rows: https://www.kaggle.com/competitions/playground-series-s4e2/data and can using it do more. data/train.csv
1.Business UnderstandingΒΆ
Overview of the dataΒΆ
The data was gathered through an online platform via a survey where respondents remained anonymous. Its objective is to gauge obesity levels among individuals from Mexico, Peru, and Colombia, spanning ages 14 to 65. This dataset will be utilized to train a model using the 'train.csv' file and subsequently applied for predictions using the 'test.csv' file. Due to the nature of its collection method, there may be inaccuracies present; nevertheless, it serves as valuable material for training and prediction model development.
Purpose of the dataΒΆ
The dataset comprises 17 attributes that concentrate on various aspects related to obesity. These attributes represent commonly recognized factors associated with obesity. The dataset was gathered to analyze several actions concerning the outcomes of obesity. By utilizing this dataset, we can construct models to predict the likelihood of obesity based on individuals' daily habits. Ultimately, this predictive capability can assist people in avoiding unhealthy behaviors. Alternatively, it can aid in the development of mechanical interventions or the creation of novel treatment methods.
Prediction for dataΒΆ
By prediction the data, we can identify the key factors that directly influence obesity. Moreover, we can analyze behavioral data to assess the likelihood of obesity.
Who may benefit from this report?ΒΆ
Every people who care about their health and daily habits should be interested on this model and report. Since it reveal the relationship between the obesity and daily behavior.
How well would your prediction algorithm need to perform useful to these third parties?ΒΆ
The model aims to identify the factors influencing obesity and determine their impact. It will provide precise estimations of how specific variables contribute to obesity. To ensure accuracy, the model will be evaluated using test datasets, as the data source encompasses an entire survey.
Attribute:ΒΆ
Eatting habits related attributes:
Frequent consumption of high caloric food (FAVC),
Frequency of consumption of vegetables (FCVC),
Number of main meals (NCP),
Consumption of food between meals (CAEC),
Consumption of water daily (CH20), and Consumption of alcohol (CALC).
The attributes related with the physical condition are:
Calories consumption monitoring (SCC),
Physical activity frequency (FAF),
Time using technology devices (TUE),
Transportation used (MTRANS)
NObesity Value:
β’Underweight Less than 18.5
β’Normal 18.5 to 24.9
β’Overweight 25.0 to 29.9
β’Obesity I 30.0 to 34.9
β’Obesity II 35.0 to 39.9
β’Obesity III Higher than 40
2.Data UnderstandingΒΆ
ImportingΒΆ
Do something review of the data.
# read the data Referencing code from lecture and in-class example
import pandas as pd
import numpy as np
print('Pandas:', pd.__version__)
print('Numpy:',np.__version__)
df = pd.read_csv('data/ObesityDataSet.csv') # read in the csv file
print(df.info())
df.head()
Pandas: 1.5.3 Numpy: 1.26.2 <class 'pandas.core.frame.DataFrame'> RangeIndex: 2111 entries, 0 to 2110 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 2111 non-null object 1 Age 2095 non-null float64 2 Height 2111 non-null float64 3 Weight 2111 non-null float64 4 family_history_with_overweight 2111 non-null object 5 FAVC 2111 non-null object 6 FCVC 2111 non-null float64 7 NCP 2111 non-null float64 8 CAEC 2111 non-null object 9 SMOKE 2111 non-null object 10 CH2O 2111 non-null float64 11 SCC 2111 non-null object 12 FAF 2111 non-null float64 13 TUE 2111 non-null float64 14 CALC 2111 non-null object 15 MTRANS 2111 non-null object 16 NObeyesdad 2111 non-null object dtypes: float64(8), object(9) memory usage: 280.5+ KB None
| Gender | Age | Height | Weight | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | NObeyesdad | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 21.0 | 1.62 | 64.0 | yes | no | 2.0 | 3.0 | Sometimes | no | 2.0 | no | 0.0 | 1.0 | no | Public_Transportation | Normal_Weight |
| 1 | Female | 21.0 | 1.52 | 56.0 | yes | no | 3.0 | 3.0 | Sometimes | yes | 3.0 | yes | 3.0 | 0.0 | Sometimes | Public_Transportation | Normal_Weight |
| 2 | Male | 23.0 | 1.80 | 77.0 | yes | no | 2.0 | 3.0 | Sometimes | no | 2.0 | no | 2.0 | 1.0 | Frequently | Public_Transportation | Normal_Weight |
| 3 | Male | 27.0 | 1.80 | 87.0 | no | no | 3.0 | 3.0 | Sometimes | no | 2.0 | no | 2.0 | 0.0 | Frequently | Walking | Overweight_Level_I |
| 4 | Male | 22.0 | 1.78 | 89.8 | no | no | 2.0 | 1.0 | Sometimes | no | 2.0 | no | 0.0 | 0.0 | Sometimes | Public_Transportation | Overweight_Level_II |
Data typeΒΆ
Generaly, each type of the dataset should be numerical, each yes/no type will convert to binary encoding 1/0. In that case, it will suitable for the model alogrithm. In the analysis process, we can keep the non-numberic datatype for easily to read understand in data visualization. In this dataset we check the non-numerical values to see the certain content and consider how to design the one-hot encoding. Let's review the valuse of the non-numerical columns
# we should comfirm the gender and other data type
non_numberic_list=['Gender','FAVC','family_history_with_overweight','CAEC','SMOKE','SCC','CALC','MTRANS','NObeyesdad']
_ = [print(i,df[i].unique()) for i in non_numberic_list] # review the non-numberic
df.describe() #check the range
Gender ['Female' 'Male'] FAVC ['no' 'yes'] family_history_with_overweight ['yes' 'no'] CAEC ['Sometimes' 'Frequently' 'Always' 'no'] SMOKE ['no' 'yes'] SCC ['no' 'yes'] CALC ['no' 'Sometimes' 'Frequently' 'Always'] MTRANS ['Public_Transportation' 'Walking' 'Automobile' 'Motorbike' 'Bike'] NObeyesdad ['Normal_Weight' 'Overweight_Level_I' 'Overweight_Level_II' 'Obesity_Type_I' 'Insufficient_Weight' 'Obesity_Type_II' 'Obesity_Type_III']
| Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | |
|---|---|---|---|---|---|---|---|---|
| count | 2095.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 |
| mean | 24.310366 | 1.701677 | 86.586058 | 2.419043 | 2.685628 | 2.008011 | 1.010298 | 0.657866 |
| std | 6.369785 | 0.093305 | 26.191172 | 0.533927 | 0.778039 | 0.612953 | 0.850592 | 0.608927 |
| min | 0.164768 | 1.450000 | 39.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
| 25% | 19.926148 | 1.630000 | 65.473343 | 2.000000 | 2.658738 | 1.584812 | 0.124505 | 0.000000 |
| 50% | 22.789402 | 1.700499 | 83.000000 | 2.385502 | 3.000000 | 2.000000 | 1.000000 | 0.625350 |
| 75% | 26.000000 | 1.768464 | 107.430682 | 3.000000 | 3.000000 | 2.477420 | 1.666678 | 1.000000 |
| max | 61.000000 | 1.980000 | 173.000000 | 3.000000 | 4.000000 | 3.000000 | 3.000000 | 2.000000 |
EncodingΒΆ
The numberical types are all the float which is based on the certain caculated method from the survery.
All the numberical types can be the continuous.
In the non-numberical type attributes, some can be ordinal/ranked, such as: CAEC, NObeyesdad, some are nominal,such as MTRANS.
In this dataset ,they are all assumed to be the indicators hypothesized to be related to obesity. Currently, we assume that there are no redundant or irrelevant indicators.
Also the string type is readable, we can convert them into numberic for next model algorithm.
At this time, we can do convert part of the data,
CAEC no, somethimes, frequnetly, always: convert to 0,1,2,3
CALC no, somethimes, frequnetly, always: convert to 0,1,2,3
NObeyesdad 'Normal_Weight' 'Overweight_Level_I' 'Overweight_Level_II'
'Obesity_Type_I' 'Insufficient_Weight' 'Obesity_Type_II'
'Obesity_Type_III' : convert to 1,2,3,4,0,5,6
yes/no ,convert to:0,1
df.replace(to_replace=['no','Sometimes','Frequently','Always','yes'],value=[0,1,2,3,1],inplace=True)
df.replace(to_replace=['Normal_Weight','Overweight_Level_I','Overweight_Level_II','Obesity_Type_I','Insufficient_Weight','Obesity_Type_II','Obesity_Type_III'],value=[1,2,3,4,0,5,6],inplace=True)
df.replace(to_replace=['Female','Male'],value=[0,1],inplace=True)
df.replace(to_replace=['Public_Transportation','Walking', 'Automobile', 'Motorbike', 'Bike'],value=[1,2,3,4,5],inplace=True)
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2111 entries, 0 to 2110 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 2111 non-null int64 1 Age 2095 non-null float64 2 Height 2111 non-null float64 3 Weight 2111 non-null float64 4 family_history_with_overweight 2111 non-null int64 5 FAVC 2111 non-null int64 6 FCVC 2111 non-null float64 7 NCP 2111 non-null float64 8 CAEC 2111 non-null int64 9 SMOKE 2111 non-null int64 10 CH2O 2111 non-null float64 11 SCC 2111 non-null int64 12 FAF 2111 non-null float64 13 TUE 2111 non-null float64 14 CALC 2111 non-null int64 15 MTRANS 2111 non-null int64 16 NObeyesdad 2111 non-null int64 dtypes: float64(8), int64(9) memory usage: 280.5 KB
| Gender | Age | Height | Weight | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | NObeyesdad | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 21.0 | 1.62 | 64.0 | 1 | 0 | 2.0 | 3.0 | 1 | 0 | 2.0 | 0 | 0.0 | 1.0 | 0 | 1 | 1 |
| 1 | 0 | 21.0 | 1.52 | 56.0 | 1 | 0 | 3.0 | 3.0 | 1 | 1 | 3.0 | 1 | 3.0 | 0.0 | 1 | 1 | 1 |
| 2 | 1 | 23.0 | 1.80 | 77.0 | 1 | 0 | 2.0 | 3.0 | 1 | 0 | 2.0 | 0 | 2.0 | 1.0 | 2 | 1 | 1 |
| 3 | 1 | 27.0 | 1.80 | 87.0 | 0 | 0 | 3.0 | 3.0 | 1 | 0 | 2.0 | 0 | 2.0 | 0.0 | 2 | 2 | 2 |
| 4 | 1 | 22.0 | 1.78 | 89.8 | 0 | 0 | 2.0 | 1.0 | 1 | 0 | 2.0 | 0 | 0.0 | 0.0 | 1 | 1 | 3 |
df_describe=pd.DataFrame()
pd.set_option('max_colwidth',200)
df_describe['Attribute']=df.columns
df_describe['Description']=['Gender','Age','Height','Weight','family',
'Frequent consumption of high caloric food',
'Frequency of consumption of vegetables',
'Number of main meals',
'Consumption of food between meals',
'SMOKE',
'Consumption of water daily','Calories consumption monitoring',
"Physical activity frequency",
"Time using technology devices",
"Consumption of alcohol",
"Transportation used","Obeyesdad"]
df_describe['Type']=['Bool,1/0','Interval,float','Ratio,float','Ratio,float','Bool,1/0','ratio,float','ratio,float','ratio,float','ratio,float',"Bool,1/0","Ordinal,int","Bool,1/0","ratio,float","ratio,float","Ordinal,int","norminal","Ordinal,int"]
df_describe['Convertion/Range']=['1=male,0=female',"14-65",'1.45-1.98','39-173','yes=1.no=0','1-3','1-3','1-3','0=no,1=sometimes,2=frequently,3=always','yes=1.no=0','1-3','yes=1.no=0','0-3','0-2','0=no,1=sometimes,2=frequently,3=always','','Normal_Weight=1,Overweight_Level_I=2 ,Overweight_Level_II=3,Obesity_Type_I=4, Insufficient_Weight=0,Obesity_Type_II=5,Obesity_Type_III=6']
df_describe
# **Eatting habits related attributes:**
# Frequent consumption of high caloric food (FAVC),
# Frequency of consumption of vegetables (FCVC),
# Number of main meals (NCP),
# Consumption of food between meals (CAEC),
# Consumption of water daily (CH20), and Consumption of alcohol (CALC). <br>
# **The attributes related with the physical condition are:**
# Calories consumption monitoring (SCC),
# Physical activity frequency (FAF),
# Time using technology devices (TUE),
# Transportation used (MTRANS) <br>
| Attribute | Description | Type | Convertion/Range | |
|---|---|---|---|---|
| 0 | Gender | Gender | Bool,1/0 | 1=male,0=female |
| 1 | Age | Age | Interval,float | 14-65 |
| 2 | Height | Height | Ratio,float | 1.45-1.98 |
| 3 | Weight | Weight | Ratio,float | 39-173 |
| 4 | family_history_with_overweight | family | Bool,1/0 | yes=1.no=0 |
| 5 | FAVC | Frequent consumption of high caloric food | ratio,float | 1-3 |
| 6 | FCVC | Frequency of consumption of vegetables | ratio,float | 1-3 |
| 7 | NCP | Number of main meals | ratio,float | 1-3 |
| 8 | CAEC | Consumption of food between meals | ratio,float | 0=no,1=sometimes,2=frequently,3=always |
| 9 | SMOKE | SMOKE | Bool,1/0 | yes=1.no=0 |
| 10 | CH2O | Consumption of water daily | Ordinal,int | 1-3 |
| 11 | SCC | Calories consumption monitoring | Bool,1/0 | yes=1.no=0 |
| 12 | FAF | Physical activity frequency | ratio,float | 0-3 |
| 13 | TUE | Time using technology devices | ratio,float | 0-2 |
| 14 | CALC | Consumption of alcohol | Ordinal,int | 0=no,1=sometimes,2=frequently,3=always |
| 15 | MTRANS | Transportation used | norminal | |
| 16 | NObeyesdad | Obeyesdad | Ordinal,int | Normal_Weight=1,Overweight_Level_I=2 ,Overweight_Level_II=3,Obesity_Type_I=4, Insufficient_Weight=0,Obesity_Type_II=5,Obesity_Type_III=6 |
Finding duplicateΒΆ
There are multiple instances of duplication within the dataset. Due to the absence of a unique identification attribute, it's possible for two individuals to have identical values in this survey based on the limited data attributes. Consequently, it's not straightforward to remove these duplicate entries.
# finding the duplicate
dup=df.duplicated()
print("There are {} duplicate instance".format(len(df[dup])))
There are 22 duplicate instance
Finding missingΒΆ
# Referencing code from lecture and in-class example
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore',DeprecationWarning)
%matplotlib inline
import missingno as mn
print('matplotlib: {}'.format(matplotlib.__version__))
mn.matrix(df)
plt.title("Not sorted", fontsize=22)
plt.figure()
mn.matrix(df.sort_values(by=["Age"]))
plt.title("Sorted",fontsize=22)
plt.show()
matplotlib: 3.8.2
<Figure size 1400x1000 with 0 Axes>
The chart displayed indicates several missing age values, potentially due to customers omitting this information. Age is a crucial factor in understanding obesity. These omissions likely stem from errors in the survey process, suggesting a lack of strict guidelines.
ImputationΒΆ
To address this issue, we can utilize height as a proxy to categorize the data, as it may correlate closely with age.
# Referencing code from lecture and in-class example
# do something group
df_grouped = df.groupby(by=['Height','NObeyesdad'])
df_imputed = df_grouped.transform(lambda grp: grp.fillna(grp.median))
print('------------')
# the above process can remove columns, so let's find that and fix it
names_removed = list(set(df.columns) - set(df_imputed.columns)) # use list like a set operator
print(names_removed)
print('------------')
df_imputed[names_removed] = df[names_removed]
# Ensure 'Age' is numeric
df_imputed['Age'] = pd.to_numeric(df['Age'], errors='coerce') # Convert non-numeric values to NaN
print(df_imputed.info())
print(df.info())
------------ ['NObeyesdad', 'Height'] ------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 2111 entries, 0 to 2110 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 2111 non-null int64 1 Age 2095 non-null float64 2 Weight 2111 non-null float64 3 family_history_with_overweight 2111 non-null int64 4 FAVC 2111 non-null int64 5 FCVC 2111 non-null float64 6 NCP 2111 non-null float64 7 CAEC 2111 non-null int64 8 SMOKE 2111 non-null int64 9 CH2O 2111 non-null float64 10 SCC 2111 non-null int64 11 FAF 2111 non-null float64 12 TUE 2111 non-null float64 13 CALC 2111 non-null int64 14 MTRANS 2111 non-null int64 15 NObeyesdad 2111 non-null int64 16 Height 2111 non-null float64 dtypes: float64(8), int64(9) memory usage: 280.5 KB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 2111 entries, 0 to 2110 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 2111 non-null int64 1 Age 2095 non-null float64 2 Height 2111 non-null float64 3 Weight 2111 non-null float64 4 family_history_with_overweight 2111 non-null int64 5 FAVC 2111 non-null int64 6 FCVC 2111 non-null float64 7 NCP 2111 non-null float64 8 CAEC 2111 non-null int64 9 SMOKE 2111 non-null int64 10 CH2O 2111 non-null float64 11 SCC 2111 non-null int64 12 FAF 2111 non-null float64 13 TUE 2111 non-null float64 14 CALC 2111 non-null int64 15 MTRANS 2111 non-null int64 16 NObeyesdad 2111 non-null int64 dtypes: float64(8), int64(9) memory usage: 280.5 KB None
# compared the distribution between the original and imputed
df_imputed.Age.plot(kind='hist', alpha=0.3, label="imputed",bins=100)
df.Age.plot(kind='hist', alpha=0.3, label="original",bins=100)
plt.legend()
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Comparison of Age Distribution')
plt.show()
# because the few data are missing the comparison are not too different
Finding outlierΒΆ
As description, the age should be from 14 to 65, there are something other values in this dataset not qualified. We can delete the unresaonble value to make it empty ,then we could do the imputation.
# we can find out the age has several outerlier,using the df after imputed
# fix this by ensuring that the Age column is converted to a numeric type after imputation.
df_imputed['Age'] = pd.to_numeric(df_imputed['Age'], errors='coerce')
df_imputed.describe()
| Gender | Age | Weight | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | NObeyesdad | Height | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2111.000000 | 2095.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 |
| mean | 0.505921 | 24.310366 | 86.586058 | 0.817622 | 0.883941 | 2.419043 | 2.685628 | 1.140692 | 0.020843 | 2.008011 | 0.045476 | 1.010298 | 0.657866 | 0.731407 | 1.488394 | 3.112269 | 1.701677 |
| std | 0.500083 | 6.369785 | 26.191172 | 0.386247 | 0.320371 | 0.533927 | 0.778039 | 0.468543 | 0.142893 | 0.612953 | 0.208395 | 0.850592 | 0.608927 | 0.515498 | 0.868475 | 1.985062 | 0.093305 |
| min | 0.000000 | 0.164768 | 39.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.450000 |
| 25% | 0.000000 | 19.926148 | 65.473343 | 1.000000 | 1.000000 | 2.000000 | 2.658738 | 1.000000 | 0.000000 | 1.584812 | 0.000000 | 0.124505 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.630000 |
| 50% | 1.000000 | 22.789402 | 83.000000 | 1.000000 | 1.000000 | 2.385502 | 3.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 0.625350 | 1.000000 | 1.000000 | 3.000000 | 1.700499 |
| 75% | 1.000000 | 26.000000 | 107.430682 | 1.000000 | 1.000000 | 3.000000 | 3.000000 | 1.000000 | 0.000000 | 2.477420 | 0.000000 | 1.666678 | 1.000000 | 1.000000 | 2.000000 | 5.000000 | 1.768464 |
| max | 1.000000 | 61.000000 | 173.000000 | 1.000000 | 1.000000 | 3.000000 | 4.000000 | 3.000000 | 1.000000 | 3.000000 | 1.000000 | 3.000000 | 2.000000 | 3.000000 | 5.000000 | 6.000000 | 1.980000 |
# we can delete it or delete row. because there is only one row has these outlier, we believe deleting it won't influnt the result
df=df_imputed[df_imputed["Age"]>=14]
df.describe()
| Gender | Age | Weight | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | NObeyesdad | Height | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 | 2094.000000 |
| mean | 0.505253 | 24.321897 | 86.658327 | 0.817574 | 0.884432 | 2.418479 | 2.687019 | 1.138968 | 0.020535 | 2.006187 | 0.045368 | 1.008037 | 0.659139 | 0.732092 | 1.489016 | 3.120821 | 1.701566 |
| std | 0.500092 | 6.349399 | 26.155286 | 0.386287 | 0.319783 | 0.533438 | 0.777198 | 0.465001 | 0.141855 | 0.611807 | 0.208159 | 0.850298 | 0.608284 | 0.514809 | 0.869741 | 1.982295 | 0.093263 |
| min | 0.000000 | 15.000000 | 39.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.450000 |
| 25% | 0.000000 | 19.935311 | 65.934516 | 1.000000 | 1.000000 | 2.000000 | 2.660328 | 1.000000 | 0.000000 | 1.574080 | 0.000000 | 0.120520 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.630000 |
| 50% | 1.000000 | 22.797110 | 83.000000 | 1.000000 | 1.000000 | 2.381934 | 3.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 0.627973 | 1.000000 | 1.000000 | 3.000000 | 1.700684 |
| 75% | 1.000000 | 26.000000 | 107.850510 | 1.000000 | 1.000000 | 3.000000 | 3.000000 | 1.000000 | 0.000000 | 2.473840 | 0.000000 | 1.661556 | 1.000000 | 1.000000 | 2.000000 | 5.000000 | 1.768385 |
| max | 1.000000 | 61.000000 | 173.000000 | 1.000000 | 1.000000 | 3.000000 | 4.000000 | 3.000000 | 1.000000 | 3.000000 | 1.000000 | 3.000000 | 2.000000 | 3.000000 | 5.000000 | 6.000000 | 1.980000 |
Summary about the DataΒΆ
Now that we've reviewed the data, identified the types of values, and understood the structure of attributes, we've also handled missing data and considered duplicate entries. Next, we can proceed with visualizing the data to draw conclusions and gain insights.
3.Data VisualizationΒΆ
Before we start the visualization, we want to use the visualization and the graphic, ,histogram, diagram, plot to find out our main question. We can go by distribution-correlation to find the key factors.
DistributionΒΆ
# reference code from the lecture
# Age distribution
plt.style.use('ggplot')
fig=plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
# we have the df is imputed dataframe
df.Age.plot.hist(bins=20)
plt.xlabel("Age")
plt.title('Histogram, 20 bins')
plt.subplot(1,3,2)
df.Age.plot.kde(bw_method=0.2)
plt.xlabel("Age")
plt.title('Kernel Density, 0.2 bandwidth')
plt.subplot(1,3,3)
plt.xlabel("Age")
df.Age.plot.hist(bins=40)
df.Age.plot.kde(bw_method=0.1, secondary_y=True)
plt.title('Hist and KDE, 50 bins and 0.05 bw')
plt.show()
import matplotlib.pyplot as plt
plt.xlabel("Weight") # Adding x-label
df.Weight.plot.hist(bins=40)
df.Weight.plot.kde(bw_method=0.1, secondary_y=True)
plt.title('Hist and KDE, 50 bins and 0.05 bw')
plt.show()
# heigh distribution, another type of distribution
plt.figure(figsize=(10,12))
df.Height.plot.hist(bins=50)
plt.axvline(df.Height.mean(), color='black', linewidth=1)
plt.axvline(df.Height.median(), color='blue', linewidth=1)
plt.xlabel("Height")
plt.title('Distribution of height')
plt.show()
# several habit distribution example
# we have the df is imputed dataframe
df.NCP.plot(kind='hist',bins=50,alpha=0.4,label='NCP')
df.CH2O.plot(kind='hist',bins=50,alpha=0.4,label='CH20')
df.FAVC.plot(kind='hist',bins=50,alpha=0.4,label='FAVC')
df.FAF.plot(kind='hist',bins=50,alpha=0.4,label='FAF')
plt.title('Habits Histogram, 20 bins')
plt.xlabel("Value")
plt.legend()
plt.show()
df.NObeyesdad.plot(kind='hist',bins=50,alpha=0.4,label='NObeyesdad')
plt.title('NObeyesdad Histogram, 20 bins')
plt.xlabel("Value")
plt.legend()
plt.show()
ViewΒΆ
From this histogram, it's evident that the majority of survey participants fall within the younger age groups. This observation suggests that the survey might have been predominantly accessed by younger individuals, possibly due to its online nature. Younger demographics tend to have easier access to the internet and are more likely to prioritize personal health concerns.
This prompts us to inquire about the primary factors contributing to obesity. The pronounced peak in age distribution provides a clear indication. Based on this dataset, it seems prudent to focus interventions primarily on younger demographics, particularly those aged approximately between 14 to 40.
Additionally, the histogram reveals a concentrated weight distribution with a noticeable peak. This suggests that weight values tend to cluster within a certain range, indicating potential patterns or trends in weight distribution among the surveyed population.
Furthermore, when examining features related to habits, it's notable that certain integer values exhibit distinct peaks. Although these features are continuous in nature, the histogram displays spikes at specific integer intervals. This suggests that certain habits or behaviors may be more prevalent among survey respondents, leading to higher frequencies at these discrete values.
From the NObeyesdad Value, the distribution is average cover all the 6 values, we can know attributions and obesity relation ship by the next correlation analysis.
CorrelationΒΆ
Gender vs ObesityΒΆ
gender=pd.crosstab(df['NObeyesdad'],df['Gender'])
gender2=pd.crosstab(df['Gender'],df['NObeyesdad'])
gender.plot(kind='bar',figsize=(15,10),stacked=True)
gender2.plot(kind='bar',figsize=(15,10),stacked=True)
plt.title('Gender vs Obesity')
plt.show()
Gender: In the graph, the total number of genders is roughly equal, but there are outlier values for gender in the NObeyesdad categories 5 and 6. The dataset lacks sufficient representation of individuals with NObeyesdad levels 5-6.
Within this dataset, females generally exhibit better obesity outcomes in the value range from 0 to 4.
Age vs ObesityΒΆ
# make the group
# reference from the lecture in class
df['Age_range']=pd.cut(df["Age"],[14,20,30,40,50,60],labels=['Teen','Youn','MidA','Senir','Elder'])
gender_age=pd.crosstab([df["Gender"],df["Age_range"]],df['NObeyesdad'])
gender_age.plot(kind='barh', stacked=True)
plt.title("Age, Gender, Obesity")
plt.show()
C:\Users\lycaz\AppData\Local\Temp\ipykernel_19576\1142479511.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Age box plot
import seaborn as sbn
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
ax=sbn.boxplot(x='NObeyesdad',y='Age',data=df,palette='Set3',hue='NObeyesdad')
sbn.pointplot(x='NObeyesdad', y='Age', data=df.groupby('NObeyesdad').mean(), ax=ax)
#violinplot of mean age
plt.subplot(1,2,2)
az=sbn.violinplot(x='NObeyesdad',y='Age',data=df,palette='Set3',hue='Gender',split=True,inner="quart")
sbn.pointplot(x='NObeyesdad', y='Age', data=df.groupby('NObeyesdad').mean(), ax=az)
plt.show()
C:\Users\lycaz\AppData\Local\Temp\ipykernel_19576\24262100.py:6: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. C:\Users\lycaz\AppData\Local\Temp\ipykernel_19576\24262100.py:11: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
In the violin plot, we integrate gender and age, revealing that older individuals tend to have a higher percentage of obesity risk. Additionally, datasets with a wide age range often exhibit gender imbalances, which can result in inaccuracies when predicting obesity rates among elderly individuals.
Correlation MatrixΒΆ
# habits and family factors
colunms_selected=['family_history_with_overweight','FAVC','FCVC','NCP','CAEC',"SMOKE",'CH2O',"CALC",'FAF','TUE','NObeyesdad','Weight']
matrix_df=df[colunms_selected]
cmap=sbn.set(style='darkgrid') #reference from the class code
f,ax=plt.subplots(figsize=(12,12))
sbn.heatmap(matrix_df.corr(),cmap=cmap,annot=True)
f.tight_layout()
Principal Components representitiveΒΆ
From the images above we could find some attritbutes are similar, and some attributes are more representative than others. Let's use PCA to find out something.
# PCA using
# reference from lecture
from sklearn.decomposition import PCA
df["Target"]=df["NObeyesdad"]
# uisng PCA to find best 2 dimensions of this Data
x=df.drop(columns="Age_range")
y=df["Target"]
pca=PCA(n_components=10)
pca.fit_transform(x)
print(pca.components_) # find the PCA components
# reference from lecture
# this is a scree plot
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Bar, Line
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
explained_var=pca.explained_variance_ratio_
cum_var_exp=np.cumsum(explained_var)
plot_explained_variance(pca)
[[ 3.08693323e-03 5.15368115e-02 9.93801090e-01 7.35157267e-03 3.29919971e-03 4.42824674e-03 3.09605646e-03 -5.08909722e-03 1.51779572e-04 4.71278847e-03 -1.60178502e-03 -1.70916734e-03 -1.77942925e-03 4.04049351e-03 -1.33273948e-03 6.90005613e-02 1.62903112e-03 6.90005613e-02] [ 9.24023678e-04 9.93469141e-01 -5.54261619e-02 6.39393947e-03 2.75733004e-04 -2.52752980e-03 -8.86863891e-03 -1.73938864e-03 2.01732147e-03 -8.81439154e-03 -2.50704147e-03 -1.94153358e-02 -2.84181445e-02 -2.24000652e-04 8.31779948e-02 2.84770837e-02 -2.00395213e-03 2.84770837e-02] [ 2.08395265e-01 3.00718830e-02 7.97545545e-02 -2.33854582e-02 -7.00152545e-03 -4.23364583e-02 1.84536487e-01 5.44611610e-02 7.40247692e-03 8.56010310e-02 2.88276205e-03 3.99486205e-01 2.24024905e-02 2.61779432e-02 2.25811368e-01 -5.86472566e-01 5.32949734e-02 -5.86472566e-01] [-1.34849847e-01 -2.39248144e-03 -4.10541728e-02 4.05616342e-02 -5.30115182e-02 1.43470847e-01 6.26673461e-01 2.40340692e-03 -5.22713518e-03 1.33189985e-01 1.57293097e-02 6.10795162e-01 -1.99731821e-02 -8.08467771e-02 2.51080380e-02 2.85847546e-01 -1.70697746e-02 2.85847546e-01] [ 8.69812664e-03 1.63374953e-02 -8.34749698e-03 -1.85283949e-03 -2.29759889e-02 3.16978200e-02 -7.19976954e-01 -1.13325177e-01 -9.32728703e-04 1.70808730e-01 1.00822066e-02 6.19747869e-01 -2.50865625e-02 -1.17198397e-01 -1.71916829e-01 6.89193751e-02 -7.53232043e-03 6.89193751e-02] [-9.06675742e-02 8.99664042e-02 1.58111568e-02 -6.67789006e-02 -1.15607464e-02 1.17194665e-01 1.56409507e-01 1.26760372e-01 1.86625008e-02 3.23127372e-02 3.44204708e-04 -5.00058746e-03 2.66443474e-03 1.43640153e-01 -9.23164484e-01 -1.58523248e-01 5.16077178e-03 -1.58523248e-01] [ 2.08902471e-01 2.91570745e-02 -7.85959751e-03 1.26991150e-01 8.34726967e-02 -4.55284644e-01 6.78909923e-02 -9.11072831e-02 3.03999855e-03 -3.91650248e-02 -4.86374784e-02 4.79982583e-02 8.07927490e-01 -1.70225862e-01 -1.32209062e-01 5.73400323e-02 1.29741707e-02 5.73400323e-02] [-6.58823639e-02 -9.31958210e-04 9.27447330e-03 -5.01950185e-02 -5.57746634e-03 7.04598056e-02 -3.80739125e-02 1.84172944e-01 8.92152557e-03 -9.33692445e-01 -3.15355832e-03 2.04988947e-01 -1.00418958e-02 -1.94125808e-01 -2.00296113e-02 -1.69671804e-02 -5.82129998e-03 -1.69671804e-02] [ 3.85527658e-01 -6.34839708e-03 -8.02324619e-03 2.67905798e-02 5.08051190e-03 -6.13930525e-01 1.25708495e-01 -2.54859206e-01 -1.10545378e-02 -8.35056244e-02 -3.95926119e-02 3.61653283e-02 -5.67196303e-01 -1.49940861e-01 -1.79523056e-01 5.54447298e-02 1.76664376e-02 5.54447298e-02] [-4.83080700e-02 -6.68023372e-03 -1.03183310e-02 -1.39197499e-01 4.89583482e-02 -2.21789578e-01 -2.86313365e-02 -1.63146198e-01 1.92200676e-02 -1.89910821e-01 9.47263028e-03 1.78496775e-01 6.03478209e-02 9.01521250e-01 6.71639433e-02 6.97989931e-02 -9.57829128e-03 6.97989931e-02]]
C:\Users\lycaz\AppData\Local\Temp\ipykernel_19576\3953324031.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Based on the correlation matrix, it's evident that most of the habits attributes positively correlate with obesity. Among these attributes, FAVC (Frequent consumption of high-caloric food) emerges as the primary factor. Conversely, CAEC (Consumption of food between meals) exhibits a negative correlation with obesity. This could be due to various reasons; for instance, individuals may consume healthy foods between meals, which might not significantly impact obesity.
It's important to note that family-related factors, although not considered as habits elements, can heavily influence obesity outcomes. The some attribute in this dataset lack sufficient volume to relfect specific feature, such as the method of transport. We can ignore the relation about that.
Question 1: Are there significant differences in obesity rates based on gender and age groups?
Answer: Yes, analysis of the dataset reveals notable disparities in obesity rates across different age and gender categories. Generally, younger individuals exhibit lower likelihoods of obesity, while females tend to have lower obesity prevalence compared to males.
Question 2: Can we identify distinct segments of individuals based on their behaviors related to obesity?
Answer: Indeed, the dataset provides various attributes related to behaviors, allowing for quantitative analysis. Utilizing correlation matrices, we observe diverse associations with obesity. For instance, attributes like FAVC and FCVC exhibit positive correlations with obesity, while CAEC shows a negative correlation. Moreover, certain behavioral factors, such as smoking habits, may not strongly influence obesity outcomes.
Question 3: How many principal components we can identify which are more explained
Answer: Based on the PCA images, we can find the two components are contributes much more than others. The first 2 componets are contributes more than 99% of the explained.
4.Exception work and Dimensionality ReductionΒΆ
The overall coherence of this report and its conclusions is evident. We conducted thorough visualizations to address the questions posed, ensuring a comprehensive exploration of the dataset. In addition, we diligently cited our sources and provided annotations to support our findings.
We made several assumptions to delve into the main elements and attributes within the dataset, which allowed us to generate subsequent questions for further exploration and analysis. This approach facilitated a robust understanding of the data and enabled us to draw meaningful conclusions.
Before we using the UMAP, we could see what the several attributes looks like in a image below:
# using the UMAP
# reference from https://github.com/lmcinnes/umap
# reference from https://umap-learn.readthedocs.io/en/latest/plotting.html
# reference from https://pair-code.github.io/understanding-umap/
# reference from https://umap-learn.readthedocs.io/en/latest/basic_usage.html
import umap
import umap.plot
from sklearn.datasets import load_digits
#,
# choose several colunms
select_colunms=['Age','NObeyesdad','Weight','FAVC','FCVC','FAF','TUE','family_history_with_overweight','NCP','CALC','Gender']
x = df[select_colunms] # Features
sbn.set_theme(style='white', context='notebook', rc={'figure.figsize':(14,10)})
sbn.pairplot(x, hue='NObeyesdad')
<seaborn.axisgrid.PairGrid at 0x2a59c4ea610>
From this image, we can find out some of the attritbutes could be more representitive than others, such as Weight
y = df["NObeyesdad"] # Target labels
# embedding = umap.UMAP().fit(x.values) # only train without transform
embedding = umap.UMAP(n_components=2,n_neighbors=80).fit_transform(df.drop(columns="Age_range"),y=df["NObeyesdad"]) # we should define the y as target for cluster the data
# default parameter : n_neighbors=15,min_dist=0.1, n_components=2, metric='euclidean' https://umap-learn.readthedocs.io/en/latest/parameters.html
embedding_untarget = umap.UMAP(n_components=2,n_neighbors=80).fit_transform(df.drop(columns="Age_range")) # we should define the y as target for cluster the data
plt.figure(figsize=(10, 8))
plt.scatter(embedding_untarget[:, 0], embedding_untarget[:, 1], c=y, cmap='viridis', s=10)
plt.colorbar(label='NObeyesdad', ticks=range(10))
plt.title('UMAP Visualization without target')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.figure(figsize=(10, 8))
plt.scatter(embedding[:, 0], embedding[:, 1], c=y, cmap='viridis', s=10)
plt.colorbar(label='NObeyesdad', ticks=range(10))
plt.title('UMAP Visualization with target')
plt.show()
Observations based on the UMAP plot: